Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix setting V1 format version for Non-REST catalogs #411

Merged

Conversation

amogh-jahagirdar
Copy link
Contributor

Currently we always set V2 format for create table for non-rest Catalogs regardless of the specified version. This change addresses that.

Comment on lines 263 to 266
schema = data.get("schema")
if schema and isinstance(schema, dict):
if "schema_id" not in data["schema"]:
data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I explicitly pass in a schema it's no longer a dictionary. Then this validator fails since data["schema"] is an acutal Schema and cannot be referenced. I'm not really familiar with Pydantic but my understanding of this validator is that it's meant for just converting the schema metadata a common representation that can be shared with V2 metadata. I think that should only happen when reading the actual JSON file.

I'm still working through another issue for partition specs in V1 (this error is thrown in the updated test):


pydantic_core._pydantic_core.ValidationError: 2 validation errors for TableMetadataV1
 partition_spec.0
Input should be a valid dictionary [type=dict_type, input_value=('spec_id', 0), input_type=tuple]
 For further information visit https://errors.pydantic.dev/2.6/v/dict_type
partition_spec.1
 Input should be a valid dictionary [type=dict_type, input_value=('fields', ()), input_type=tuple]
  For further information visit https://errors.pydantic.dev/2.6/v/dict_type

I'm not sure why this happens for V1 metadata explicitly constructed but V2 works fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amogh-jahagirdar Yes, this is a bit awkward in Pydantic. This is a @model_validator(mode="before") which means that it will always apply the validator before trying to construct the model. This will allow you to correct certain things before applying the checks. This particular function will make sure that the schema-id is set. Where schema_id is required. Keep in mind that this also does have applied the aliases, so both schema_id and schema-id are valid.

The before-mode model validator also be applied when you initialize something off: TableMetadataV1(bla=(1,2,3))

The error that you're seeing is after the mode="before" validators are applied. I would put a breakpoint in the validator, and trace up the stack where the bogus data comes from.

tests/catalog/test_hive.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this :)

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID
schema = data.get("schema")
if schema and isinstance(schema, dict):
if "schema_id" not in data["schema"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "schema_id" not in data["schema"]:
if "schema_id" not in schema:

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
tests/catalog/test_hive.py Outdated Show resolved Hide resolved
Comment on lines 263 to 266
schema = data.get("schema")
if schema and isinstance(schema, dict):
if "schema_id" not in data["schema"]:
data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amogh-jahagirdar Yes, this is a bit awkward in Pydantic. This is a @model_validator(mode="before") which means that it will always apply the validator before trying to construct the model. This will allow you to correct certain things before applying the checks. This particular function will make sure that the schema-id is set. Where schema_id is required. Keep in mind that this also does have applied the aliases, so both schema_id and schema-id are valid.

The before-mode model validator also be applied when you initialize something off: TableMetadataV1(bla=(1,2,3))

The error that you're seeing is after the mode="before" validators are applied. I would put a breakpoint in the validator, and trace up the stack where the bogus data comes from.

@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Feb 11, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch 5 times, most recently from 8c5f085 to e3f5dcb Compare February 11, 2024 16:43
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch 3 times, most recently from 4d8e486 to d0c964f Compare February 11, 2024 18:40
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this during the weekend. This looks good, I just have some concerns about setting the schema and partition-spec using the before-validator.

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
@@ -313,6 +315,34 @@ def construct_partition_specs(cls, data: Dict[str, Any]) -> Dict[str, Any]:

return data

@model_validator(mode="before")
def construct_v1_spec_from_v2_fields(cls, data: Dict[str, Any]) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid the before validators to patch in the partition-spec and schema (probably sort-order as well?). I'd much rather see that we explicitly pass in the spec below:

    if format_version == 1:
        return TableMetadataV1(
            location=location,
            schema=fresh_schema,
            last_column_id=fresh_schema.highest_field_id,
            current_schema_id=fresh_schema.schema_id,
+           partition_spec=fresh_partition_spec,
            partition_specs=[fresh_partition_spec],
            default_spec_id=fresh_partition_spec.spec_id,
+           sort_order=fresh_sort_order,
            sort_orders=[fresh_sort_order],
            default_sort_order_id=fresh_sort_order.order_id,
            properties=properties,
            last_partition_id=fresh_partition_spec.last_assigned_field_id,
            table_uuid=table_uuid,
        )

My biggest concern is again that this is a before validator, and we don't take into account the aliases (partition-specs, partition-spec-id, etc).

It looks like Java just ignores the schema if there is a schemas: https://github.com/apache/iceberg/blob/5f577f1b9902ffe6181897a439686a31fc81b89a/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L343-L368

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it just felt a bit awkward to pass in the fields in a redundant way. As long as the source of truth is correct when reading the fields, which it is as you saw for schemas then we're good. I've updated.

Also PartitionSpec on the V1 metadata is a list[dict[str, any]], right now I work with that to preserve compatibility but that does make things a bit awkward since we have to pass in the model_dump of partition spec instead of just the normal PartitionSpec.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @HonahX had a good catch here, it's not just a dump of partition spec, the List[dict[str, any]] is a list of the fields themselves based on the spec. I updated test_metadata. Also, The tests in #245 exercise this path more so going forward so we should have more coverage.

@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch 5 times, most recently from 7b3aa35 to 9fc37fe Compare February 12, 2024 00:42
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this! @amogh-jahagirdar It's great that we no longer require extra before validator for supporting the v1 metadata. Setting the relevant fields explicitly is a smart approach

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Feb 12, 2024

Thanks for taking this! @amogh-jahagirdar It's great that we no longer require extra before validator for supporting the v1 metadata. Setting the relevant fields explicitly is a smart approach

Thanks but I don't know about that tbh. I think it's a pretty bad API that we are somewhat forced to specify all fields even though it's redundant for V1 metadata (because it's deprecated so we want to specify the new fields, but partition-spec is still required). We should revisit that imo, but I think that can be done in a follow on PR since the goal here is to solve the original problem of not being able to create V1 tables in non-rest catalogs. Also doesn't really matter in practice too much since constructing new metadata is done behind the catalog anyways which is the API the vast majority of people will be interacting with.

@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-format-version-metastore-catalogs branch from 9fc37fe to 4dbc2de Compare February 12, 2024 04:40
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think it's a pretty bad API that we are somewhat forced to specify all fields even though it's redundant for V1 metadata

I guess that's the case with maintaining backward compatibility :(. Fortunately, as you mentioned, this is behind new_table_metadata and catalog, thereby not requiring users to specify any redundant fields.

I thought pydantic's before validator might help, but as we saw in this PR, it's more effective for parsing raw data than for constructing models directly. So, it looks like our current approach is the best we have right now.

We should revisit that ...

Totally agree! For the time being, this approach should work fine. We may need to revisit it when we need to deprecate some fields or introduce V3 metadata.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for fixing this over the weekend, and @HonahX for jumping in for the review 🙌

@Fokko Fokko merged commit dab5d76 into apache:main Feb 12, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants